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Abstract 

Background: The aim of a genome-wide association study (GWAS) is to isolate DNA markers for variants affecting 
phenotypes of interest. This is constrained by the fact that the number of markers often far exceeds the number of 
samples. Compressed sensing (CS) is a body of theory regarding signal recovery when the number of predictor 
variables (i.e., genotyped markers) exceeds the sample size. Its applicability to GWAS has not been investigated. 

Results: Using CS theory, we show that all markers with nonzero coefficients can be identified (selected) using an 
efficient algorithm, provided that they are sufficiently few in number (sparse) relative to sample size. For heritability 
equal to one {h 2 = 1), there is a sharp phase transition from poor performance to complete selection as the sample 
size is increased. For heritability below one, complete selection still occurs, but the transition is smoothed. We find 
for h 2 ~ 0.5 that a sample size of approximately thirty times the number of markers with nonzero coefficients is 
sufficient for full selection. This boundary is only weakly dependent on the number of genotyped markers. 

Conclusion: Practical measures of signal recovery are robust to linkage disequilibrium between a true causal variant 
and markers residing in the same genomic region. Given a limited sample size, it is possible to discover a phase 
transition by increasing the penalization; in this case a subset of the support may be recovered. Applying this 
approach to the GWAS analysis of height, we show that 70-100% of the selected markers are strongly correlated 
with height-associated markers identified by the GIANT Consortium. 

Keywords: GWAS, Genomic selection, Compressed sensing, Lasso, Underdetermined system, Sparsity, 
Phase transition 



Background 

The search for genetic variants associated with a given 
phenotype in a genome-wide association study (GWAS) 
is a classic example of what has been called a p » n 
problem, where n is the sample size (number of subjects) 
and p is the number of predictor variables (genotyped 
markers) [1]. Estimating the partial regression coeffi- 
cients of the predictor variables by ordinary least 
squares (OLS) requires that the sample size exceed the 
number of coefficients, which in the GWAS context, 
may be of order 10 or even 10 6 . The difficulty of as- 
sembling such large samples has been one obstacle 
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hindering the simultaneous estimation of all regression 
coefficients advocated by some authors [2-4]. 

The typical procedure in GWAS is to estimate each 
coefficient by OLS independently and retain those meet- 
ing a strict threshold; this approach is sometimes called 
marginal regression (MR) [5]. Although the implemen- 
tation of MR in GWAS has led to an avalanche of dis- 
coveries [6], it is uncertain whether it will be optimal 
as datasets continue to increase in size. Many genetic 
markers associated with a trait are likely to be missed 
because they do not pass the chosen significance 
threshold [7]. 

Unlike MR, which directly estimates whether each 
coefficient is nonzero, an Li-penalization algorithm, 
such as the lasso, effectively translates the estimates 
toward the origin where many are truncated out of the 
model [8]. If the number of variants associated with a typ- 
ical complex trait is indeed far fewer than the total num- 
ber of polymorphic sites [9-11], then it is reasonable to 



© 2014 Vattikuti et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative 
Commons Attribution License (http://creativecommons.Org/licenses/by/2.0), which permits unrestricted use, distribution, and 
reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain 
Dedication waiver (http://creativecommons.Org/publicdomain/zero/1.0/) applies to the data made available in this article, 
unless otherwise stated. 



Vattikuti et al. GigaScience 2014, 3:10 
http://www.gigasciencejournal.eom/content/3/1/10 



Page 2 of 1 7 



believe that L x penalization will at least be competitive 
with MR. Methods relying on the assumption of sparsity 
(few nonzero coefficients relative to sample size) have in 
fact been adopted by workers in the field of genomic 
selection (GS), which uses genetic information to guide 
the artificial selection of livestock and crops [12-15]. 
Note that the aim of GS (phenotypic prediction) is 
somewhat distinct from that of GWAS (the identifica- 
tion of markers tagging causal variants). The lasso is 
one of the methods studied by GS investigators [16,17], 
although Bayesian methods that regularize the coeffi- 
cients with strong priors tend to be favored [18,19]. 

In this paper we show that theoretical results from the 
field of compressed sensing (CS) supply a rigorous quan- 
titative framework for the application of regularization 
methods to GWAS. In particular, CS theory provides 
a mathematical justification for the use of L\ -penalized 
regression to recover sparse vectors of coefficients and 
highlights the difference between selection of the markers 
with nonzero coefficients and the fitting of the precise 
coefficient values. CS theory also addresses the robust- 
ness of Li algorithms to the distribution of nonzero 
coefficient magnitudes. 

Besides supplying a rule of thumb for the sample size 
sufficing to select the markers with true nonzero coef- 
ficients, CS gives an independent quantitative criter- 
ion for determining whether a given dataset has, in 
fact, attained that sample size. Whereas biological 
assumptions regarding the number of nonzeros do 
enter into the rule of thumb about sample size, these 
assumptions need not hold for the use of L\ penaliza- 
tion to be justified; this is because the returned results 
themselves inform the investigator whether the as- 
sumptions are met. 

We emphasize that CS is not a method per se, but 
may be considered a general theory of regression that 
takes into account model complexity (sparsity). The 
theory is still valid in the classical regression domain of 
n > p but establishes conditions for when full recovery 
of nonzero coefficients is still possible when n < p [20-22] . 
Our work therefore should not be directly compared to 
recent literature proposing and evaluating GS methods 
[18,19]. Rather, our goal is to elucidate properties of 
well-known methods, already in use by GWAS and GS 
researchers, whose mathematical attributes and empir- 
ical prospects may be insufficiently appreciated. 

Using more than 12,000 subjects from the Athero- 
sclerosis Risk in Communities Study (ARIC) European 
American and Gene-Environment Association Studies 
(GENEVA) cohorts and nearly 700,000 single-nucleotide 
polymorphisms (SNPs), we show that the matrix of 
genotypes acquired in GWAS obeys properties suitable 
for the application of CS theory. In particular, a given sam- 
ple size determines the maximum number of nonzeros 



that will be fully selected using an /^-penalization regres- 
sion algorithm. If the sample size is too small, then the 
complete set of nonzeros will not be selected. The transi- 
tion between poor and complete selection is sharp in the 
noiseless case (narrow-sense heritability equal to one). It is 
smoothed in the presence of noise (heritability less than 
one), but still fully detectable. Consistent with CS theory, 
we find in cases with realistic residual noise that the min- 
imal sample size for full recovery is primarily determined 
by the number of nonzeros, depends very weakly on the 
number of genotyped markers [22-24], and is robust to 
the distribution of coefficient magnitudes [25]. 

Theory of compressed sensing 

The linear model of quantitative genetics is 

y = Ax + e (1) 

Where y e R n is the vector of phenotypes, A e R n * p is 
the matrix of standardized genotypes, xeR^ is the vec- 
tor of partial regression coefficients, and eeR w is the 
vector of residuals. In the CS literature, A is often called 
the sensing or measurement matrix. Standardizing A 
does not affect the results and makes it simpler to utilize 
CS theory. We suppose that x contains s nonzero coeffi- 
cients ("nonzeros") whose indices we wish to know. 

The phase transition to complete selection is best 
quantified with two ratios (p, 8), where p = sin is a 
measure of the sparsity of nonzeros with respect to the 
sample size and 8 = nip is a measure of the undersam- 
pling. If we plot 8 on the abscissa ( #-axis) and p on the 
ordinate (y-axis), we have a phase plane on the square 
(0, 1) x (0, 1), where each point represents a possible 
GWAS situation (sample size, number of genotyped 
markers, number of true nonzeros). The performance 
of any given method can be assessed by evaluating a 
measure of recovery quality at each point of the plane. 
For an arbitrary ^-vector x, we use the following nota- 
tion for the Li and L 2 norms: 

|| x 11^=^1 xt | and || x || i2 = Jj^x] 

i=l V i=l 

Our results rely on two lines of research in the field of 
CS, which we summarize as two propositions. 

Proposition 1 [20,24,26,27] Suppose that the entries of 
the sensing matrix A are drawn from independent normal 
distributions and e is the zero vector (noiseless case). Then 
the p- 8 plane is partitioned by a curve p = p Li (8) into 
two phases. Below the curve the solution of min^ || x \\ L 
subject to Ax = y leads to x = x with probability con- 
verging to one as n,p, s — > oo in such a way that p and 8 
remain constant. Above the curve x*x with similarly 
high probability. 
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The function p Ll (S) can be analytically calculated [26]. 
Although Figure 1A presents some of our empirical re- 
sults, which we will discuss below, it can be taken as an 
illustration of the meaning of Proposition 1. The color 
scale represents the goodness of recovery, and the black 
curve is the graph of p L {8). It can be seen that increas- 
ing the sample size relative to s (decreasing p) leads to a 
sharp transition from poor to good recovery for 8 < < 1 
(i.e. n<<p). In other words, despite the fact that solving 
for x in Ax = y is strictly speaking underdetermined 
given n < p, minimizing || x || i x subject to the system of 
equations still yields recovery of x with high probability 
if n is sufficiently large relative to s. 

Most phenotypes do not have a heritability of one and 
are therefore, not noiseless, but CS theory shows that se- 
lection is still possible in this situation. Before stating 



the relevant CS result, we need to define two quantities 
characterizing the genotype matrix A. 

Definition 1 [22] The matrix A satisfies isotropy if the 
expectation value of AA is equal to the identity matrix. 

In the context of GWAS, a matrix of gene counts is 
isotropic if all markers are in linkage equilibrium (LE). 

Definition 2 [22] The coherence of the matrix A is the 
smallest number y such that, for each row a of the 
matrix, 



max | a f | <y 

\<t<p 



Thus, a matrix of genotypes is reasonably incoherent if 
the magnitudes of the matrix elements do not differ 
greatly from each other. In the GWAS context, A will be 
reasonably incoherent if all markers with very low minor 



0.8 



0.6 



0.4 



0.2 



c 1 



LU 



/ 

t □ 



0.5 

6 




I 



0.8 



0.6 



0.4 



0.2 



0.2 0.4 0.6 0.8 1 

6 



□ 



m 



a • • 



0.03 



0.4 



: jp. (A) Color corresponds to the 



Figure 1 Error in the p - 6 plane for a measurement matrix of random genomic SNPs (p = ^ and 6 - 

normalized error (NE) of the coefficients ^~jj^ 2 - The black curve is the expected phase boundary between poor and good recovery from [26]. The 
number of SNPs, p, was fixed at 8,027. The heritability was set to one (noiseless case). The circles correspond to the points (p = 0.08, 6 = 0.1 9) 
(white) and (p = 0.1 25, 6 = 0.125) (red) discussed in Measures of selection. (B) Same as panel (A), except that the heritability was set to 0.5 (noisy 
case). The white circle corresponds to the point (p = 0.025, 6 = 0.625) discussed in Measures of selection. (C) NE versus p for fixed n = 4,000 and 
p= 8,027 (blue corresponds to h 2 = 1, red to h 2 = 0.5). The square markers indicate recovery quality evaluated at a few data points using the lasso 
algorithm with 10-fold cross-validation written by MATLAB. 
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allele frequency (MAF) are pruned, since A is standard- 
ized and the standard deviation scales with MAF. 
We can now state 

Proposition 2 [22] Suppose that the sensing matrix A 
is isotropic with coherence y If n> C y slogp for a con- 
stant C then the solution of the problem 

min[||y -Ax|£ + A|| x \\ Li ] 

with a suitable choice ofX obeys 

II* " x H^-^f 5 polylogfc) 

where o\ is the variance of the residuals in e. 

Two features of Proposition 2 are worth noting. First, 
no strong restrictions on x are required. Second, the crit- 
ical threshold value of n depends linearly on 5, but only 
logarithmically on p. For n larger than the critical value, 
the deviations of the estimated coefficients from the true 
values will follow the expected OLS scaling of l/y/n. 

These results are more powerful than they might seem 
from the restrictive hypotheses required for brief formu- 
lations. For example, it has been shown that a curve 
similar to that in Proposition 1 also demarcates a phase 
transition in the case of e * 0 — although, as might be 
expected from a comparison of Propositions 1 and 2, 
with large residual noise the transition is to a regime of 
gradual improvement with n rather than to instantan- 
eous recovery [24,28]. A remarkable feature of this grad- 
ual improvement, however, should be noted. Proposition 
2 states that the scaling of the total fitting error in the 
favorable regime is within a polylogarithmic factor of 
what would have been achieved if the identities of the s 
nonzeros had been revealed in advance by an oracle. 
This result implies that perfect selection of nonzeros can 
occur before the magnitudes of the coefficients are well fit. 
Even if the residual noise is substantial enough to prevent 
the sharp transition from large to negligible fitting error 
evident in Figure 1A, the total magnitude of the error in 
the favorable phase is little larger than what would be 
expected given perfect selection of the nonzeros. 

Recent work has also generalized the sensing matrix, 
A, in Proposition 1 to several non-normal distributions 
(although not to genotype matrices per se) [27,29]. Fur- 
thermore, the form of Proposition 2 also holds under a 
weaker form of isotropy that allows the expectation of 
A'A to differ from the identity matrix by a small quantity 
(see [22] for the specification of the matrix norm). The 
latter generalization is promising because the covariance 
matrix in GWAS deviates toward block-diagonality as a 
result of linkage disequilibrium (LD) among spatially 
proximate variants. 



Whereas the penalization parameter \ in Proposition 2 
is often determined empirically through cross-validation, 
CS places a theoretical lower bound on its value that is 
based on the magnitude of the noise [22] (referred here 
as X min or A). A special feature of the GWAS context is 
that an estimate of the residual variance can be ob- 
tained from the genomic-relatedness method [7,30-32], 
thereby enabling the substitution of a theoretical noise- 
dependent bound for empirical cross-validation. Such 
noise-dependent bounds appear in other selection the- 
ories, including MR, and thus are not specific to CS 
[5,33]. As noted by [33], such bounds tend to be con- 
servative. Here, we show that the CS noise-dependent 
bound demonstrates good selection properties. A data- 
specific method, such as cross-validation may exhibit 
slightly better properties, but is computationally more 
expensive. 

Given this body of CS theory, a number of questions 
regarding the use of L x -penalized regression in GWAS 
naturally arise: 

1. Does the matrix of genotypes A in the GWAS 
setting fall into the class of matrices exhibiting the 
CS phase transition across the curve p Ll (S), as 
described by Proposition 1? 

2. Since large residual noise is typical, we must also 
ask: is A sufficiently isotropic and incoherent to 
make the regime of good performance described by 
Proposition 2 practically attainable? Since log p 
slowly varies over the relevant range of p we can 
absorb y and log p into the constant factor and 
phrase the question more provocatively: given that 
n > Cs is required for good recovery, what is C? 

3. In practice, a measure of recovery relying on the 
unknown x, such as a function of ||x - x|| l2 , cannot 
be used. Is there a measure of recovery, then, that 
depends solely on observables? 

The aim of the present work is to answer these three 
questions. 

Data description 

All participants gave informed consent. All studies 
were approved by their appropriate Research Ethics 
Committees. 

We used the ARIC and GENEVA European American 
cohort. The datasets were obtained from dbGaP through 
dbGaP accession numbers [ARIC:phs000090] and [GEN- 
EVA:phs000091] [34]. The ARIC population consists of 
a large sample of unrelated individuals and some fam- 
ilies. The population was recruited in 1987 from four 
centers across the United States: Forsyth County, North 
Carolina; Jackson, Mississippi; Minneapolis, Minnesota; 
and Washington County, Maryland. 
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The ARIC subjects were genotyped with the Affyme- 
trix Human SNP Array 6.0. We selected biallelic auto- 
somal markers based on a Hardy- Weinberg equilibrium 
tolerance of P < 10" 3 . Preprocessing was performed with 
PLINK 2 [35,36]. 

The datasets were merged to create a SNP genotype 
matrix (A) consisting of 12,464 subjects and 693,385 
SNPs. SNPs were coded by their minor allele, resulting 
in values of 0, 1, or 2. Each column of A was standard- 
ized to have zero mean and unit variance. Missing geno- 
types were replaced with the mean (i.e., zero) after 
standardization. We compared results for the phase 
transition for a limited number of cases when the miss- 
ing genotypes were imputed based on sampling from a 
Binomial distribution and the respective minor allele fre- 
quency. We found no difference between the imputation 
methods for our datasets. 

We simulated phenotypes according to Equation 1, 
rescaling each term to leave the phenotypic variance 
equal to unity and the variance of the breeding values 
in Ax to match the target narrow-sense heritability /z 2 , 
which is the proportion of phenotypic variance due to 
additive genetic factors. For standardized phenotypes, 
h 2 is equivalent to the additive genetic variance, which 
is defined to equal one in the noiseless case. We chose 
h 2 = 0.5 to represent the noisy case because many 
human traits show a SNP-based heritability close to 
this value [7,30,37]. 

The magnitudes of the s nonzeros in x were drawn 
from either the set {-1, 1} or hyperexponential distribu- 
tions. We defined two hyperexponential distributions 
(Hyperexponential 1 and 2) and each was generated by 
summing two exponentials with the same amplitude, but 
different decay constants. The pair of decay constants 
for Hyperexponential 1 were 0.055 and p, and that of 
Hyperexponential 2 were 0.25 and p. The coefficients 
were then truncated to keep only the top 5 nonzero coef- 
ficients, the rest were made zero, and 50% of the non- 
zeros had negative signs. The hyperexponential form 
was motivated by [38], but the decay constants were 
arbitrarily chosen. For all coefficient ensembles, the non- 
zeros were randomly distributed among the SNPs. When 
examining the dependence of an outcome on n, p, and 5 
the set p was either chosen randomly across the genome 
without replacement or restricted to all chromosome 22 
SNPs, and n and 5 were randomly sampled without re- 
placement. A single set of SNPs was used for all analyses 
of the genomic random p set. 

We also considered a real phenotype (height) rather 
than a simulated one, using 12,454 subjects with mea- 
surements of height adjusted for sex. We examined dif- 
ferent values of n and fixed p by always using all 
markers in our dataset. A called nonzero was counted as 
a true positive in the numerator of our "adjusted positive 



predictive value" (to be defined later) if the marker was a 
member of a proxy set based on height-associated SNPs 
discovered by the GIANT Consortium [39]. The set was 
generated using the BROAD SNAP database [40]. We 
based our proxy criterion on basepair distance rather 
than LD, as we found the correlations between SNPs in 
our dataset to be larger in magnitude than those re- 
corded in the SNAP database. We generated a proxy list 
based on a maximum basepair distance of 500 kb, which 
was the maximum distance that could be queried. 

Analysis 

Phase transition to complete selection 

We first studied the case of independent markers to gain 
insight into the more realistic case of LD among spatially 
proximate markers [17,41]. In the noiseless case (e = 0), 
it has been proven that there is a universal phase transi- 
tion boundary between poor and complete selection in 
the p - S plane (Proposition 1) [20,24,26,27]. The exist- 
ence of this boundary is largely independent of the 
explicit values of 5, n, and p for a large class of sensing 
matrices, including sensing matrices generated by the 
multivariate normal distribution. However, the transi- 
tion boundary does depend, on certain properties of 
the distribution describing the coefficients. For ex- 
ample, the boundary can depend critically on whether 
the coefficients are all positive or can have either sign, 
although the particular form of the distribution within 
either of these two broad classes is less important. 
Genetic applications typically have real-valued coeffi- 
cients, which are in the same class (i.e., in terms of 
phase transition properties) as coefficients drawn from 
the set {-1, 1} [25,42], which we used in the majority 
of our simulations. We also studied selection perform- 
ance when the coefficients are hyperexponentially dis- 
tributed (see Data Description). 

The phase transition can be explored using multiple 
measures of selection quality. Figure 1A shows the 
normalized error (NE) (Equation 5) of the coefficient 
estimates returned by the Li-penalized regression algo- 
rithm in our study of a simulated phenotype and a 
random selection of SNPs ascertained in a real GWAS 
for the noiseless case. The boundary between poor and 
good performance, as evidenced by this measure, was 
well approximated by the theoretically derived curve 
[26], confirming that a matrix of independent SNPs 
ascertained in GWAS qualifies as a CS sensing matrix. 

The noiseless case corresponds to a trait with a per- 
fect narrow-sense heritability (h 2 = 1). Although there 
are some phenotypes that approach this ideal situation, 
it is important to consider the more typical situation of 
h 2 < 1. Fi gure IB shows how the NE varied in the pres- 
ence of a noise level corresponding to h 2 = 0.5 (which is 
roughly the SNP-based heritability of height [7,30]). 



Vattikuti et al. GigaScience 2014, 3:10 
http://www.gigasciencejournal.eom/content/3/1/10 



Page 6 of 1 7 



We can see that the transition boundary was smoothed 
and effectively shifted downward. 

In the noisy case, the transition boundary was less 
dependent on S than in the noiseless case. Note that in 
Figure 1A-B the noise variance is fixed by h 2 , but p and 
S are both functions of the sample size. Fixing p and tra- 
versing the phase plane horizontally can be interpreted 
as using a sample of size n to study a particular pheno- 
type with s nonzeros, changing the number of genotyped 
markers in successive assays; Figure IB shows that in the 
noisy case an order-of-magnitude change in p had a neg- 
ligible impact on the quality of selection. 

Given this insensitivity to 8, it is instructive to increase 
the resolution with which the phase transition can be 
studied by fixing S and then comparing the h 2 = 1 and 
h 2 = 0.5 cases. Figure 1C shows that the NE approached 
its asymptote beyond the theoretical phase transition in 
both cases. Moreover, the asymptote appeared to be 
greater than zero in the noiseless case. This behavior 
may suggest that the noise-dependent A min prescribed by 
CS theory is suboptimal when noise is in fact absent; 
although the closeness of the theoretical and empirical 
phase boundaries implies that the deviation from opti- 
mality is mild. The transition was not altered in the 
noiseless case when A min was estimated using cross- 
validation, although there was some improvement in the 
noisy case. A 10-fold cross-validation increased the com- 
putational time by 10 to 100-fold. The similar quality of 
selection achieved by the theoretical A min and the use of 
cross-validation supports the theoretical estimate. 

In the noiseless case, when using a criterion of NE < 0.5, 
the phase transition to vanishing NE began at p ~ 0.4. 
In the noisy case of h 2 = 0.5, the phase transition began 
at p ~ 0.03 (n ~ 30s). As expected, the sample size for a 
given number of nonzero coefficients must be larger in 
the presence of noise. 

Measures of selection 

We next examined whether nonzeros were being cor- 
rectly selected despite a nonzero NE by considering 
additional measures of selection: 

1. The false positive rate (FPR), the fraction of true 
zero-valued coefficients that are falsely identified as 
nonzero. 

2. The positive predictive value (PPV), the number of 
correctly selected true nonzeros divided by the total 
number of nonzeros returned by the selection 
algorithm. 1 - PPV equals the false discovery rate 
(FDR). 

3. The median of the P-values obtained when 
regressing the phenotype on each of the Li-selected 
markers in turn (pt P _ VSL \ ue ). Each such P-value is the 
standard two-tailed probability from the t test of 



the null hypothesis that a univariate regression 
coefficient is equal to zero. The previous measures 
of recovery— NE, FPR, PPV— cannot be computed 
in realistic applications because they depend on the 
unknown x, and thus it is of interest to examine 
whether an observable quantity such as ftp -value 
also undergoes a phase transition at the same 
critical sample size. 

We hypothesized that a measure of the P-value distri- 
bution of the putative nonzero set may reflect the phase 
transition since the distribution of P-values of normally 
distributed random variables is uniform and is the basis 
of false discovery approaches for the multiple compari- 
sons problem [43]. 

We now turn to the behavior of these performance 
metrics as a function of sample size. In the noiseless case 
(Figure 2A-B), the NE showed a phase transition at n ~ 
1,000, but the PPV, FPR and / w jP _ value converged around 
n = 1,500. Since we fixed s to be 125, the location of the 
transition boundary with respect to the NE at the point 
(p = 0.125, £ = 0.125) was consistent with Figure 1A. Also 
shown is the point {p = 0.08, £ = 0.19), where the PPV, 
FPR, and {t P _ value converged. As the noise was increased 
(Figure 2C), the NE declined less sharply with increasing 
n, as expected from Figure 1. In contrast and shown in 
Figure 2D, the other measures (particularly the PPV&nd 
ftp -value) neared their asymptotic values even in the pres- 
ence of noise. The transitions of FPR, PPV, and ftp -value 
from poor to good performance were not smoothed by 
noise to the same extent as the transition of the NE. 

The greater robustness of the FPR, PPV and ftp -value 
against residual variance relative to the NE shows that 
accurate selection of nonzeros can occur well before the 
precise fitting of their coefficient magnitudes. The fact 
that the observable quantity ^_ va i ue exhibits this 
robustness is particularly important; a steep decline in 
ftp -value across subsamples of increasing size drawn 
from a given dataset demonstrates a transition to good 
recovery and implies that the full dataset has sufficient 
power for accurate identification. This is an empirical 
finding that deserves further investigation. 

For h 2 - 0.5 and across all measures of performance 
other than the NE, the transition appeared to be around 
n = 5,000. Given s = 125 and p = 8,027, this corresponds 
to (p = 0.025, £=0.625), which is circled in Figure IB. 
This estimate of the critical p is consistent with our pre- 
vious estimate when S was fixed at 0.5, supporting the 
weak dependence on p. 

Quality of selection in the presence of LD 

We have shown that randomly sampled SNPs from a 
GWAS of Europeans have the properties of a compressed 
sensor. This was expected, given that randomly sampled 
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Figure 2 Measures of selection as a function of sample size for the measurement matrix of random genomic SNPs. Fixing s = 125 and 
p= 8,027, we measured the selection of true nonzero coefficients according to four metrics for h 2 = 1 (A-B) and h 2 = 0.5 (C-D). Shown in (A-C) is 
the normalized error of the coefficients (NE). Shown in (B-D) are the positive predictive value {PPV, blue dots), false positive rate (FPR, green dots), 
and median P -value (^ P _ va | ue , green asterisks). The point n= 1,000 corresponds to (p = 0.1 25, 6 = 0.125) and n = 5,000 to (p = 0.025, 6 = 0.625) 
noted in Figure 1A and B respectively. 



markers will be mostly uncorrelated and therefore closely 
estimate an isotropic matrix. 

We next consider a genotype matrix characterized by 
LD. To do this, while still being able to evaluate recovery 
at all points of the p-S plane, we considered all geno- 
typed markers only on chromosome 22. Almost all of 
these markers were in LD with a few other markers, and 
the markers within each correlated group tended to be 
spatially contiguous (Figure 3C). As shown in Figure 3 A 
and B, the phase transition boundary with respect to NE 
was shifted to lower values of p and was less sensitive to 
(!) as in Figure IB. 

Although the phase transition from large to small NE 
appeared to be affected adversely by LD (at least in the 
noiseless case as shown in Figure 3A), the selection mea- 
sures were less affected, as seen by comparing Figure 4 
calculated using the intact chromosome 22 with Figure 2 
using markers drawn at random from across the gen- 
ome. Regardless of LD, the transition from poor to 
good values of ^_ value occurred at nearly the same 



sample size (about 30 times the number of nonzeros for 
h 2 = 0.5). The PPFand FPR saturated at worse asymptotic 
values in the noiseless case. In the noisy case, the PPV 
was also lower; perhaps surprisingly, the FPR actually 
increased with sample size. 

The relatively poor performance of the PPV and FPR 
in the case of LD is somewhat misleading. For example, 
an "off-by-one" (nearby) nonzero called by Z^-penalized 
regression will not count toward the numerator of the 
PPV, even if it is in extremely strong LD with a true 
nonzero. At the same time, such a near miss does count 
toward the numerator of the FPR This standard of re- 
covery quality seems overly stringent when we recall that 
picking out the causal variant from a GWAS "hit" region 
containing multiple marker SNPs in LD continues to be 
a challenge for the standard MR approach [44,45]. 

We examined whether the false positives called by the 
Li -penalized algorithm were indeed more likely to be 
in strong LD with the true nonzeros by computing the 
correlations between false positives and true nonzeros 
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Figure 3 Analysis of chromosome 22. (A) The p-6 plane for h 2 = 1. p was set to 8,915. Superimposed is the expected phase boundary when 
there is neither noise nor LD [26]. (B) The same as panel (A), except for h 2 = 0.5. (C) The matrix of correlations (positive roots of the r 2 LD measure) 
between genotyped SNPs on chromosome 22. Inset is a 100x 100 sample along the diagonal. 




0.01 



1000 2000 3000 
n 



0 2000 4000 6000 8000 
n 



C 

0.06 



o 0.04 



i 

Q_ 



0.02 



0 



D X10" 3 



i 

Q_ 



0 



40 



1000 2000 3000 0 2000 4000 6000 8000 
n n 

Figure 4 Measures of selection as a function of sample size for chromosome 22 (s = 125 and p = 8 7 915). The PPV (blue) and FPR (green) 
for h 2 = 1 (A) and h 2 = 0.5 (B). p P - value for h 2 = 1 (C) and h 2 = 0.5 (D). 



Vattikuti et al. GigaScience 2014, 3:10 
http://www.gigasciencejournal.eom/content/3/1/10 



Page 9 of 1 7 



for n = 5,000 and h 2 = 0.5. Figure 5 shows the histogram 
of the maximum correlation between each false positive 
and any of the true nonzeros. We compared this histo- 
gram to a realization from the null distribution, gener- 
ated by drawing markers at random from chromosome 
22 and finding each marker's largest correlation with any 
of the true nonzeros. The observed histogram featured 
many more large correlations than the realization from the 
null distribution, implying that the false positives showed a 
significant tendency to be in LD with true nonzeros. 

Figure 6 provides a visualization of the correlations 
among the false positives and true nonzeros. High corre- 
lations between false positives (upper left panel) and be- 
tween true nonzeros (lower right panel) lie near the 
main diagonal of self-correlations indicating spatial 
proximity of correlated SNPs as expected from the LD 
structure shown in Figure 3C. There are also high corre- 
lations between false positives and true nonzeros (upper 
right and lower left panels). These high correlations are 
also mostly confined to spatially proximate SNPs dem- 
onstrating a marked tendency for called false positives to 
occur close to one of the true nonzeros. 

Sensitivity to the distributions of coefficient magnitudes 
and MAF 

The appropriate prior on the distribution of coefficient 
magnitudes is often discussed [19]. However, CS theory 




0 0.2 0.4 0.6 0.8 1 



+r 

Figure 5 Distribution of maximum correlations between false 

positives and true nonzeros after the presumptive ju P _„ a/ue 

phase transition for chromosome 22. Histogram of the maximum 
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between a false positive and true nonzero for chromosome 22, 
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Figure 6 The matrix of correlations (positive roots of the r 2 LD 
measure) among false positives and true nonzeros after the 
presumptive [x P - vaiue phase transition for chromosome 22 

(s = 1 25, n = 5, 000, and h 2 = 0.5). SNP indices begin at the top left 
corner. The upper-left quadrant contains the correlations among 
false positives and the lower-right quadrant contains the correlations 
among the true nonzeros. Each element in the upper-right (lower-left) 
quadrant represents a correlation between a false positive and a true 
nonzero. Within both the false positive and the true nonzero sets, the 

markers are arranged in order of chromosomal map position. 

v y 

shows that the phase boundary for complete selection is 
relatively insensitive to this distribution. To test this pre- 
diction, we looked for evidence of performance degrad- 
ation upon replacing the discrete distribution of nonzero 
coefficients used thus far with a hyperexponential dis- 
tribution (a mixture of exponential distributions with 
different decay constants) (these are defined in Data 
Description and shown in Figure 7A). The hyperex- 
ponential distribution is a means of implementing an 
arguably more realistic ensemble of a few large coefficients 
followed by a tail of weaker values [38]. Figure 7B-C shows 
that, as predicted by theoretical CS results, for fixed h 2 
and chromosome 22, the normalized value con- 
verged to zero at the same sample size regardless of the 
ensemble. 

In the previous simulations, we drew the nonzeros at 
random from all genotyped markers, thus guaranteeing 
that the MAF spectra of the nonzeros and the entire 
genotyping chip would tend to coincide. Here, we also 
tested whether the MAF spectrum of nonzeros affects 
the selection phase boundary. It is known that two SNPs 
can be in strong LD only if they have similar MAFs 
[46,47] . We confirmed this by taking all pairs of markers 
on chromosome 22 and plotting the maximum positive 
root of the LD measure as a function of squared MAF 
difference (Figure 8A). Therefore, in order to isolate any 
effect of the MAF distribution among nonzeros not 
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Figure 7 Insensitivity of the selection phase boundary to the distribution of coefficient magnitudes (ensemble). (A) s = 1 25 coefficient 
magnitudes ("effect sizes") ordered from large to small for the Uniform (blue), Hyperexponential 1 (red), and Hyperexponential 2 (green) 
ensembles. (B) Chromosome 22 analysis using fj P _ va \ ue to measure selection (normalized by the maximum value) as a function of sample size for 
h 2 = 1 for the Uniform (blue) and Hyperexponential 1 (red) ensembles. (C) As in panel (B) except for h 2 = 0.5. Also shown is recovery for the 
Hyperexponential 2 ensemble (green). 



mediated by LD, we constructed a synthetic measure- 
ment matrix A with independent columns and the same 
MAF spectrum as chromosome 22. We then compared 
recovery when the nonzero coefficients were sampled 
from SNPs with MAF between 0.0045 and 0.015, or when 



they were sampled above MAF of 0.49. For this we used 
nonzeros from {-1, 1}. Figure 8B shows no difference in 
recovery between the conditions for h 2 = 0.5. This suggests 
that MAF alone is not a determinant of the phase transi- 
tion. Homogeneity in MAF among nonzeros may enrich 
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Figure 8 Insensitivity of the selection phase boundary to minor allele frequency (MAF) for chromosome 22. (A) The maximum positive 
root of the r 2 LD measure (+r) as a function of squared MAF difference. The maxima are estimated over bin lengths of 0.05 for SNPs in 
chromosome 22. (B) The median P -value ^ P _ value ) normalized by the maximum value as a function of sample size for s = 125 from {-1, 1} and 
h 2 = 0.5 for nonzero coefficients sampled from low (blue) or high (red) MAF SNPs on chromosome 22. 
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correlations as noted above. Such correlations would 
be expected to reduce the effective s and thus affect the 
phase boundary. 

Selection of SNPs associated with height 

Motivated by the results above, we examined whether 
the full sample size of 12,454 subjects was sufficient to 
achieve the phase transition from poor to good recovery 
of SNPs associated with a real phenotype (height). We 
considered the selection measures //p_ value and adjusted 
the positive predictive value (PPV); the latter extended 
true-positive status to any selected SNP within 500 kb of 
a SNP identified as a likely marker of a height-affecting 
variant in the GIANT Consortiums analysis of ~ 180,000 
unrelated individuals [39]. This extension is consistent 
with the rule of thumb designating a 1-Mb region as a 
"locus" for purposes of counting the number of GWAS 
"hits" [48]. The relative insensitivity of ^_ value to LD 
suggests that PPV rewards the identification of both 
true nonzeros and markers tagging nonzeros; we there- 
fore substituted PPV for PPV in an attempt to align the 
phase dynamics of our precision measure with those of 
ftp -value- Whether a selected marker fell within 500 kb 
of a GIANT-identified marker was determined by con- 
sulting the Broad Institutes SNAP database [40]. 



Figure 9 A shows that ^_ va i ue failed to approach zero, 
suggesting that that n = 12, 454 is not large enough to 
see a phase transition to the regime of good recovery. 
Given our empirical finding that p ~ 0.03 is required 
for h 2 ~ 0.5, this suggests that height is affected by at 
least 400 causal variants, a result consistent with the 
observation that the -250 known height-associated 
SNPs account for only a small proportion of this trait s 
additive genetic variance [48]. However, the null PPV" 
derived from randomly chosen SNPs was smaller than 
the observed PPV" (Figure 9A); this was consistent 
with the detection of some true signal. In other words, 
although no phase transition was evident, the recovery 
measure did improve with increased sample size. 

The penalization parameter A was set using CS theory 
to minimize NE error based on the expected noise-level 
from reported narrow sense heritability for height [7,30]. 
If A is set too low, then more false positives are expected; 
if A is set too high, then true nonzeros will be missed. 
According to CS theory, an Z^-penalized method can 
still select some of the largest coefficients from a non- 
uniform distribution of coefficient magnitudes even if 
complete recovery is out of reach [49]. We investigated 
whether it was possible to achieve a phase transition to 
low fi P _ va i U e and high PPV\ at the cost of recovering 
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Figure 9 Selection measures as a function of sample size in an analysis of real height data. (A) The adjusted positive predictive value 
{PPV*, blue solid dots) and median P -value iid P - va \ uei red) as a function of sample size using X based on h 2 = 0.5. Also shown is PPV* when the 
same number of SNPs are randomly selected rather than returned by the i } algorithm (blue unfilled dots). (B) As in (A) but setting A to a value 
appropriate for /i 2 = 0.01 . 
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only a small fraction of all true nonzeros, by increasing 
the penalty parameter A. More specifically, we set A to a 
higher value consistent with h 2 = 0.01 rather than 0.5. 
In this case, the L x algorithm returned 20 putative non- 
zeros rather than the original 403, and both /Wp_ value 
and PPV* exhibited better performance (Figure 9B). 
Compared to the less stringent X, PPV as a function of 
n was less smooth, but appeared to stabilize to a high 
recovery value after - 7000 subjects. Evidently, if the 
sample size does not suffice to capture the full herit- 
ability, setting the penalty parameter to a value appro- 
priate for a lower heritability can lead to a smaller set 
of selected markers characterized by good precision. 

Figure 10 illustrates the physical distances between the 
markers selected in our strict-A (assuming h 2 = 0.01) 
analysis and the markers identified by the GIANT Con- 
sortium. Of the 20 Li- selected markers, 14 were within 
500-kb of a GIANT-identified marker. However, the 
Li-selected markers defined to be false positives were 
still relatively close to GIANT-identified markers. This 



may indicate that the 500-kb criterion for declaring a 
true positive was too stringent; if so, then our stated 
PPV* of 0.7 can be regarded as a lower bound. As an 
informal comparison, Figure 10 also displays the results of 
a more standard MR-type GWAS analysis. For a P-value 
of 10" 8 and all 12,454 subjects, MR returned six SNPs, 
five of which were GIANT-identified markers, and four 
were exact matches with SNPs selected by our L x algo- 
rithm (Figure 10). With a P -value cutoff of 5 x 10" 8 
and all subjects, MR returned 13 markers, 10 of which 
were GIANT-identified, and 7 of which were identical 
to the L x -selected markers. 

The presence of a phase transition is not necessarily 
restricted to L x algorithms, but rather may represent a 
deeper phenomenon in signal recovery. Other methods 
may show a similar phase transition— although CS the- 
ory suggests that, among convex optimization methods, 
those within the L x class are closest to the optimal com- 
binatorial L 0 search. We conducted additional analyses 
to test whether a phase transition at a critical sample 




Figure 10 Map of SNPs associated with height, as identified by the GIANT Consortium meta-analysis, L n -penalized regression, and 
standard GWAS. Base-pair distance is given by angle, and chromosome endpoints are demarcated by dotted lines. Starting from 3 o'clock and 
going counterclockwise, the map sweeps through the chromosomes in numerical order. As a scale reference, the first sector represents 
chromosome 1 and is ~ 250 million base-pairs. The blue segments correspond to a 1 Mb window surrounding the height-associated SNPs 
discovered by GIANT. Note that some of these may overlap. The yellow segments represent L ] -selected SNPs that fell within 500 kb of a 
(blue) GIANT-identified nonzero; these met our criterion for being declared true positives. The red segments represent L ] -selected SNPs that 
did not fall within 500 kb of a GIANT-identified nonzero. Note that some yellow and red segments overlap given this figure's resolution. There 
are in total 20 yellow/red segments, representing L } -selected SNPs found using all 1 2,454 subjects. The white dots represent the locations of 
SNPs selected by MR at a P -value threshold of 1 0" 8 
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size could also be observed when our height data were 
analyzed using the MR approach commonly used in 
GWAS. In these simulations we varied the P -value 
threshold for genome-wide significance. As measures of 
selection are potentially subject to a phase transition, we 
examined the PPV* and the adjusted median P -value 
valued The latter measure was defined to be the me- 
dian P -value among those SNPs surviving the P -value 
cutoff, divided by the cutoff itself; the normalization was 
necessary to remove the dependence on the choice of 
cutoff. As shown in Figure 11, the P -value threshold 
10" 8 yielded very few selected SNPs, and in fact, none 
were returned at sample sizes smaller than approxi- 
mately 8,000. However, ^p_ value was mostly close to zero 
in the region of Figure 11B corresponding to n>8, 000 
and P - value < 10" 6 , suggesting that true nonzeros were 
being selected. This is confirmed by the fact that the PPV* 
typically exceeded 0.6 in this same region (Figure 11 A). For 
P -value thresholds less stringent than 10" 6 , signs of a phase 
transition at a critical sample size were still discernible. 

A search for a phase transition can be a useful ap- 
proach to determining the optimal P -value threshold in 
standard GWAS protocols employing MR. In addition to 
a priori assumptions regarding the likely number of true 



nonzeros and their coefficient magnitudes [38,50] and 
agreement between studies of different designs [51], 
GWAS investigators might rely on whether a measure such 
as /4_ value undergoes a clear phase transition as they take 
increasingly large subsamples of their data. A majority of 
markers surviving the most liberal significance threshold 
bounding the second phase are likely to be true positives. 

Discussion 

Our results with real European GWAS data and simu- 
lated vectors of regression coefficients demonstrate the 
accurate selection of those markers with nonzero coeffi- 
cients, consistent with CS sample size requirements (n) 
for a given sparsity (s) and total number of predictors 
(p). We found that the matrix of standardized genotypes 
exhibits the theoretical phase transition between poor 
and complete selection of nonzeros (Proposition 1). We 
also found, as for Gaussian random matrices in earlier 
studies, that the phase transition depends on the scaling 
ratios p = sin and 8 = nip [42]. 

We obtained results regarding the effect of noise (i.e., 
h 2 < 1) that are consistent with earlier empirical studies of 
random matrices and recently proven theorems [22,24,28] . 
Generally speaking, we show that the critical sample size 
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Figure 11 Measures of recovery using marginal regression (standard GWAS) as a function of sample size. All SNPs surviving the 
chosen - log 10 P- value threshold were selected. The recovery measures, computed over the selected SNPs, were (A) the adjusted positive predictive 
value (PPV*) and (B) the median P -value divided by the P -value cutoff. Highlighted in red is the cutoff we used for MR in Figure 10. 
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is determined mainly by the ratio of s to n and only weakly 
sensitive to p, particularly as noise increases. For example, 
if h 2 - 0.5, which is roughly the narrow-sense heritability 
of height and a number of other quantitative traits 
[7,30,37], we find that p should be less than approximately 
0.03 for recovery irrespective of S. There is no hope of 
recovering the complete vector of coefficients x above 
this threshold (i.e., smaller sample sizes). For example, 
if we have prior knowledge that 5 = 1, 200, then this 
means that the sample size should be no less than 
40,000 subjects. We find empirically that for h 2 ~ 0.5, 
n ~ 305 is sufficient for selection of the nonzeros. 

In real problems we cannot rely on measures of model 
recovery based on the unknown x. Hence, we introduced 
a new measure based on the median P -value of the 
Li -selected nonzeros, ^_ va i ue . We found that 

ftp - value 

provides a robust means of detecting the boundary be- 
tween poor and good recovery. Proposition 2 shows that 
the recovery error NE in the favorable phase scales with 
p and noise; however, we observed that the recovery 
measures FPR, PPFand /Wp_ value approached zero faster 
than the NE, confirming that accurate identification of 
nonzeros can occur well before precise estimation of 
their magnitudes. 

An Li -penalized regression algorithm is equivalent to 
linear regression with a Laplace prior distribution of 
coefficients, and in theory a Bayesian method invoking a 
prior distribution better matching the unknown true 
distribution of nonzero coefficients should outperform 
the lasso in effect estimation. However, it is by no means 
clear that the performance of L x penalization with 
respect to selection can be bettered. For example, the lasso 
and BayesB display rather similar performance properties 
[17]. However, both methods clearly outperformed ridge 
regression (a non- L x method), which exhibited no phase 
transition away from poor performance. Furthermore, it is 
usually accepted by GWAS researchers that knowledge 
of the markers with nonzero coefficients may be quite 
valuable, even if the actual magnitudes of the coeffi- 
cients are not well determined. Combining the advan- 
tages of different approaches by applying one of them 
to the L 1 -selected markers is a possibility. 

Perhaps contrary to intuition, but consistent with 
theoretical results for CS [25,42], we found that the 
phase transition to good recovery (at least as measured 
by fi P _ va i ue ) was insensitive to the distribution of coeffi- 
cient magnitudes. It is well known in CS that L x -penalized 
regression is nearly minimax optimal (minimizes the 
error of the worst case), and that the phase transition is 
robust to the distribution of coefficient magnitudes. In 
some cases a good prior may reduce the mean-square 
error and shift the location of the phase transition [52], 
However, simulations supporting this notion, were per- 
formed with a much higher signal-to-noise ratio (SNR) 



than hypothesized for realistic GWAS problems. The 
performance increase was attenuated as the SNR was 
decreased to levels still higher than usual in GWAS 
(10 dB or h 2 > 0.9 where SNR on the dB scale is given 

by 10- log 10 ^-^ ). These algorithms are currently being 

explored in lower-SNR regimes. We observed that 
cross-validation did slightly affect the phase transition 
boundary in the noisy case; nevertheless the theoretical 
penalization parameter proved to be a good rule of 
thumb for initial screening. Calculating the theoretical 
penalty depends on knowledge of h 2 , which may be esti- 
mated using the genomic-relatedness method [7,30-32] . 

Genomic selection methods have been criticized by 
researchers who doubt that the number of nonzeros (5) 
will typically be smaller than a practically attainable sam- 
ple size (n) [19]. The application of CS theory circumvents 
this problem because it allows the optimization method to 
self-determine whether or not the nonzero markers are 
sufficiently sparse compared to the sample size. No prior 
assumptions are required. Furthermore, there is evidence 
that a number of traits satisfy the sparsity assumption in 
humans, at least with respect to common variants contrib- 
uting to heritability [9-11]. 

CS theory does not provide performance guarantees in 
the presence of arbitrary correlations (LD) among pre- 
dictor variables: it must be verified empirically, as we 
have done. In agreement with previous results [17], we 
find that the phase transition, as measured by NE, is 
strongly affected by LD. However, according to our sim- 
ulations using all genotyped SNPs on chromosome 22, 
Li -penalized regression does select SNPs in close prox- 
imity to true nonzeros. The difficulty of fine-mapping 
an association signal to the actual causal variant is 
a limitation shared by all statistical gene-mapping ap- 
proaches — including marginal regression as implemented 
in standard GWAS — and thus should not be interpreted 
as a drawback of L x methods. 

We found that a sample size of 12,464 was not sufficient 
to achieve full recovery of the nonzeros with respect to 
height. However, the penalization parameter A is set by CS 
theory so as to minimize the NE based on the expected 
noise-level. In some situations it might be desirable to tol- 
erate a relatively large NE in order to achieve precise, but 
incomplete recovery (few false positives, many false nega- 
tives). By setting A to a strict value appropriate for a low- 
heritability trait (in effect, looking for a subset of markers 
that account for only a fraction of the total heritability, 
with consequently higher noise), we found that a phase 
transition to good recovery can be achieved with smaller 
sample sizes, at the cost of selecting a smaller number of 
markers and hence suffering many false negatives. 

One interesting feature of the recovery measure based 
on the median P -value (ftp- VSi iue) is that it seemed to 
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rise as the sample size was increased in the region of 
poor recovery and then fall after the sample size 
crossed the CS-determined phase transition boundary. 
This rise and then fall was very dramatic in our simula- 
tions (Figures 2 and 4) and also appeared in our analysis 
of height (Figure 9). This behavior may be a consequence 
of the fact that as the sample size is increased, A in the 
algorithm is decreased (see Methods). Hence, in the 
region of poor recovery, the relaxation of the penalty 
with increasing sample size may permit the selection 
of more SNPs and hence the inflation of the FPR and 
ftp -value- However, once the phase transition to good 
performance begins, the recovery measures begin their 
characteristic sharp decrease. This non-monotone be- 
havior accentuates the transition boundary and can be 
exploited to aid its detection. 

In summary, compressed sensing utilizes properties of 
high-dimensional systems that are surprising from the 
perspective of classical statistics. The regression problem 
faced by GWAS and GS is well-suited to such an 
approach, and we have shown that the matrix of SNP 
genotypes formed from European GWAS data is in fact 
a well-conditioned sensing matrix. Consequently, we 
have inferred the sample sizes required to achieve accur- 
ate model recovery and demonstrated a method for de- 
termining whether the minimal sample size has in fact 
been obtained. 

Methods 

/-! -penalized regression algorithm 

Li-penalized regression (e.g., lasso) minimizes the ob- 
jective function 

II y-y \\l 2 + II * K (2) 

where y is the estimated breeding value given by Ax. 
The setting of the penalization parameter A is described 
below. 

The algorithm was performed using pathwise coord- 
inate optimization and the soft-threshold rule [53]. 
Regression coefficients were sequentially updated with 

x,(A)«-S (xj(X) + X - $>y(y r y,), xj for j 

= 1,2, ...,/> (3) 

where 

S(z,A)=sign(z)(|z|-A)+ 

{z-A, if z > 0 and A < | z | , ( , 

z + A, if z < OandA < \z \ , ^ ' 

0, ifA>|z| 

We assumed convergence if the fractional change in 
the objective function given by Equation 2 was less 



than 10" . In addition, we performed lasso with a 
warm start [54], using a logarithmic descent of 100 
steps in A with A max = (^) II Ay \\ L ro . For A min we used 

(<r* E /n) || Ae \\ Loo) where o\ = \Jg\ + \ [22]. To esti- 
mate ||A ' ell^oo we created 1,000 sample vectors of e, 
each constructed with n i.i.d. normal elements with 
mean zero and variance one, and took the median 
across samples of HA'dl^oo. Estimates of (0^,0^) with 
respect to the variants assayed in a given study can be 
obtained using the genomic-relatedness method [7,30-32]. 
The algorithm can also accommodate any other covariates. 

Computations 

Simulations and analyses were performed using MATLAB 
2013 (The MathWorks Inc., Natick, Massachusetts) and 
PLINK 2 [35,36]. The L x -optimization algorithm was writ- 
ten in MATLAB and also a feature of PLINK 2. P -values 
were estimated using MATLABs regstats function and 
PLINK 2. Color-coded phase plane figures were generated 
by sampling the p - S plane and interpolating between points 
using MATLABs scatteredlnterpolant function. GWAS data 
were obtained from dbGaP as described in Data Descrip- 
tion. Analysis scripts are available from the GigaScience 
GigaDB repository and maintained on GitHub [55,56]. 

Statistics 

The normalized coefficient error (NE) is 



The false positive rate (FPR) is the fraction of true 
zero-valued coefficients that are falsely identified as non- 
zero. The positive predictive value (PPV) is the number 
of correctly selected true nonzeros divided by the total 
number of nonzeros returned by the selection algorithm. 
1 - PPV equals the false discovery rate (FDR). The ad- 
justed positive predictive value (PPV") is similar to the 
standard PPV, except that any selected nonzero coeffi- 
cient falling within 500 kb of a GIANT- identified marker 
is counted as a true positive [39]. 

The median of the P -values for the set of putative 
nonzeros (^p- va iue) is obtained by: 1) regressing the 
phenotype on each of the L x -selected markers in turn, 
2) estimating each P -value as the standard two-tailed 
probability from the t test of the null hypothesis that a 
univariate regression coefficient is equal to zero, and 3) 
taking the median over the independent tests. This 
procedure is independent of the selection algorithm 
and calculated after the L x -penalized algorithm has 
converged. The adjusted median P -value (/4_ va i ue ) is 
the median of the MR P -values falling below the 
significance threshold divided by the threshold itself. 
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The LD measure (r 2 ) is the squared estimate of the 
Pearsons product-moment correlation between the stan- 
dardized zero-mean, unit-variance SNPs. 

Analysis codes are archived in the GigaScience GigaDB 
repository and maintained on GitHub [55,56]. 

Availability of supporting data 

As noted above, the data sets supporting the results of this 
article are available through dbGaP accession numbers 
[ARIC:phs000090] and [GENEVA:phs000091], http://www. 
ncbi.nlm.nih.gov/gap [34]. Mock data sets supporting the 
results of this article are available in the GigaDB repository, 
doi:10.5524/100094 and http://gigadb.org/dataset/view/id/ 
100094/ [55]. 
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